Parts of the data.frame and ggplot2 material are based on the R for data analysis and visualization Data Carpentry course.
The file we’ll be reading in is a dataset that has been 1) processed in Skyline and 2) summarized by each run and protein with MSstats. We will practice with it.
Tip Often you’ll get data delivered as a Microsoft Excel file. You can export any spreadsheet to a .csv (comma separated values) file in Excel through the Save As.. > Format: Comma Separated Values (.csv) menu item.
In Rstudio, go to the environnment pane, click on Import Dataset dropdown and choose From Text File... from the dropdown menu. Import the iPRG_example_runsummary.csv file from your data directory, and inspect that Rstudio correctly parsed the text file into an R data.frame.
Now inspect the Console Environment pane again. Notice that a new variable for the iPRG_example data frame was created in the environment by executing the read.csv function. Let’s have a look at the documentation for this function by pulling up the help pages with the ?.
iprg <- read.csv("./data/iPRG_example_runsummary.csv")
The iprg object that we created is a data.frame
class(iprg)
## [1] "data.frame"
These object are the equivalent of a sheet in a spreadsheet file. They are composed on a set of columns, which are different vectors (or characters, numerics, factors, …) as seen previously.
There are actually some additional cont strains compared to a spreadsheet. Rather than being limitations, these constrains are an important feature that allow some standardisation and hence automatic computations.
All the data in a data.frame must be included in a column, as a vector. This means that it’s not possible to add random notes or values, as is sometimes seen in spreadsheets.
All columns/vectors must have the same length, as opposed to spreadsheets, where sometimes some values or summary statistics are added at the bottom.
No colours or font decorations.
This leads us to a very important concept in data formatting and data manipulation, which is that data should be tidy, where
There are two important reasons that we want tidy data
Note that data is always tidy, and for good reasons so. For example, omics data is often presented as shown below
| JD_06232014_sample1-A.raw | JD_06232014_sample1_B.raw | |
|---|---|---|
| sp|D6VTK4|STE2_YEAST | 26.58301 | 26.81232 |
| sp|O13297|CET1_YEAST | 24.71809 | 24.71912 |
| sp|O13329|FOB1_YEAST | 23.47075 | 23.37678 |
| sp|O13539|THP2_YEAST | 24.29661 | 27.52021 |
| sp|O13547|CCW14_YEAST | 27.11638 | 27.22234 |
which is not strictly tidy, as the protein intensity is presented along multiple columns. Some situations lend themselves more to a long or wide format (as we will see later), but the data should never be messy, as for example below:
Challenge
Compare the structure of the data presented above (loaded from the
iprg2.rdafiles) and theiprgdata.
Data frames are the de facto data structure for most tabular data, and what we use for statistics and plotting.
A data frame can be created by hand, but most commonly they are generated by the functions read.csv() or read.table(); in other words, when importing spreadsheets from your hard drive (or the web).
A data frame is the representation of data in the format of a table where the columns are vectors that all have the same length. Because the column are vectors, they all contain the same type of data (e.g., characters, integers, factors). We can see this when inspecting the structure of a data frame with the function str():
str(iprg)
## 'data.frame': 36321 obs. of 7 variables:
## $ Protein : Factor w/ 3027 levels "sp|D6VTK4|STE2_YEAST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Log2Intensity: num 26.8 26.6 26.6 26.8 26.8 ...
## $ Run : Factor w/ 12 levels "JD_06232014_sample1_B.raw",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Condition : Factor w/ 4 levels "Condition1","Condition2",..: 1 1 1 2 2 2 3 3 3 4 ...
## $ BioReplicate : int 1 1 1 2 2 2 3 3 3 4 ...
## $ Intensity : num 1.18e+08 1.02e+08 1.01e+08 1.20e+08 1.16e+08 ...
## $ TechReplicate: Factor w/ 3 levels "A","B","C": 2 3 1 1 2 3 1 2 3 2 ...
data.frame objectsWe already saw how the functions head() and str() can be useful to check the content and the structure of a data frame. Here is a non-exhaustive list of functions to get a sense of the content/structure of the data. Let’s try them out!
dim(iprg) - returns a vector with the number of rows in the first element, and the number of columns as the second element (the dimensions of the object)nrow(iprg) - returns the number of rowsncol(iprg) - returns the number of columnshead(iprg) - shows the first 6 rowstail(iprg) - shows the last 6 rowsnames(iprg) - returns the column names (synonym of colnames() for data.frame objects)rownames(iprg) - returns the row namesstr(iprg) - structure of the object and information about the class, length and content of each columnsummary(iprg) - summary statistics for each columnNote: most of these functions are “generic”, they can be used on other types of objects besides data.frame.
Challenge
Based on the output of
str(iprg), can you answer the following questions?
- What is the class of the object
iprg?- How many rows and how many columns are in this object?
- How many proteins have been assayed?
Our data frame has rows and columns (it has 2 dimensions), if we want to extract some specific data from it, we need to specify the coordinates we want from it. Row numbers come first, followed by column numbers. However, note that different ways of specifying these coordinates lead to results with different classes.
iprg[1] # first column in the data frame (as a data.frame)
iprg[, 1] # first column in the data frame (as a vector)
iprg[1, 1] # first element in the first column of the data frame (as a vector)
iprg[1, 6] # first element in the 6th column (as a vector)
iprg[1:3, 3] # first three elements in the 3rd column (as a vector)
iprg[3, ] # the 3rd element for all columns (as a data.frame)
head_iprg <- iprg[1:6, ] # equivalent to head(iprg)
: is a special function that creates numeric vectors of integers in increasing or decreasing order, test 1:10 and 10:1 for instance.
You can also exclude certain parts of a data frame using the - sign:
iprg[, -1] # The whole data frame, except the first column
iprg[-c(7:36321), ] # Equivalent to head(iprg)
As well as using numeric values to subset a data.frame columns can be called by name, using one of the four following notations:
iprg["Protein"] # Result is a data.frame
iprg[, "Protein"] # Result is a vector
iprg[["Protein"]] # Result is a vector
iprg$Protein # Result is a vector
For our purposes, the last three notations are equivalent. RStudio knows about the columns in your data frame, so you can take advantage of the autocompletion feature to get the full and correct column name.
Challenge
Create a
data.frame(iprg_200) containing only the observations from row 200 of theiprgdataset.Notice how
nrow()gave you the number of rows in adata.frame?
- Use that number to pull out just that last row in the data frame.
- Compare that with what you see as the last row using
tail()to make sure it’s meeting expectations.- Pull out that last row using
nrow()instead of the row number.- Create a new data frame object (
iprg_last) from that last row.Extract the row that is in the middle of the data frame. Store the content of this row in an object named
iprg_middle.Combine
nrow()with the-notation above to reproduce the behavior ofhead(iprg)keeping just the first through 6th rows of theiprgdataset.
When we did str(iprg) we saw that several of the columns consist of numerics, however, the columns Protein, Run, and Condition, are of a special class called a factor. Factors are very useful and are actually something that make R particularly well suited to working with data, so we’re going to spend a little time introducing them.
Factors are used to represent categorical data. Factors can be ordered or unordered, and understanding them is necessary for statistical analysis and for plotting.
Factors are stored as integers, and have labels (text) associated with these unique integers. While factors look (and often behave) like character vectors, they are actually integers under the hood, and you need to be careful when treating them like strings.
Once created, factors can only contain a pre-defined set of values, known as levels. By default, R always sorts levels in alphabetical order. For instance, if you have a factor with 2 levels:
sex <- factor(c("male", "female", "female", "male"))
R will assign 1 to the level "female" and 2 to the level "male" (because f comes before m, even though the first element in this vector is "male"). You can check this by using the function levels(), and check the number of levels using nlevels():
levels(sex)
## [1] "female" "male"
nlevels(sex)
## [1] 2
Sometimes, the order of the factors does not matter, other times you might want to specify the order because it is meaningful (e.g., “low”, “medium”, “high”), it improves your visualization, or it is required by a particular type of analysis. Here, one way to reorder our levels in the sex vector would be:
sex # current order
## [1] male female female male
## Levels: female male
sex <- factor(sex, levels = c("male", "female"))
sex # after re-ordering
## [1] male female female male
## Levels: male female
In R’s memory, these factors are represented by integers (1, 2, 3), but are more informative than integers because factors are self describing: "female", "male" is more descriptive than 1, 2. Which one is “male”? You wouldn’t be able to tell just from the integer data. Factors, on the other hand, have this information built in. It is particularly helpful when there are many levels (like the species names in our example dataset).
If you need to convert a factor to a character vector, you use as.character(x).
as.character(sex)
## [1] "male" "female" "female" "male"
stringsAsFactors=FALSEBy default, when building or importing a data frame, the columns that contain characters (i.e., text) are coerced (=converted) into the factor data type. Depending on what you want to do with the data, you may want to keep these columns as character. To do so, read.csv() and read.table() have an argument called stringsAsFactors which can be set to FALSE.
In most cases, it’s preferable to set stringsAsFactors = FALSE when importing your data, and converting as a factor only the columns that require this data type.
Challenge
Compare the output of
str(surveys)when settingstringsAsFactors = TRUE(default) andstringsAsFactors = FALSE:
| dimensions | number of types | |
|---|---|---|
vector |
1 | 1 |
matrix |
2 | 1 |
array |
any | 1 |
data.frame |
2 | 1 per colums |
list |
1 (length) | any |
Let’s explore some basic properties of our dataset. Go to the RStudio Environment pane and double click the iPRG_example entry. This data is in tidy, long format, which is an easier data format for data manipulation operations such as selecting, grouping, summarizing, etc.
Data exported out of many omics processing or quantification tools are often formatted in wide format, which is easier to read when we would like to compare values (i.e intensity values) for specific subjects (i.e peptides) across different values for a variable of interest such as (i.e conditions). We’ll format a summary of this dataset as a ‘wide’ data frame later in this tutorial.
Let’s do some more data exploration by examining how R read in the iPRG dataset.
Challenge
Explore the data as described below
- What is the class of the variable?
- What dimension is it? How many rows and columns does it have?
- What variables (column names) do we have?
- Look at the few first and last lines to make sure the data was imported correctly.
- Display a summary of the whole data.
Let’s now inspect the possible values for the Conditions and the BioReplicate columns. To aswer the questions, below, we will need to use the unique function. From the manual page, we learn that
'unique' returns a vector, data frame or array like 'x' but with
duplicate elements/rows removed.
For example
unique(c(1, 2, 4, 1, 1, 2, 3, 3, 4, 1))
## [1] 1 2 4 3
unique(c("a", "b", "a"))
## [1] "a" "b"
dfr <- data.frame(x = c(1, 1, 2),
y = c("a", "a", "b"))
dfr
## x y
## 1 1 a
## 2 1 a
## 3 2 b
unique(dfr)
## x y
## 1 1 a
## 3 2 b
Challenge
- How many conditions are there?
- How many biological replicates are there?
- How many condition/technical replicates combinations are there?
It is often useful to start a preliminary analysis, or proceed with a more detailed data exploration using a smalle subset of the data.
Challenge
Select subsets of rows from iPRG dataset. Let’s focus on
- Condition 1 only
- Condition 1 and TechReplicate A
- all measurements on one particular MS run.
- Conditions 1 and 2
For each of there, how many measurements are there?
Make a histogram of all the MS1 intensities, quantified by Skyline, for iPRG_example.
hist(iprg$Intensity)
Our histogram looks quite skewed. How does this look on log-scale?
Do you recognise this distribution? The distribution for log2-transformed intensities looks very similar to the normal distribution. The advantage of working with normally distributed data is that we can apply a variety of statistical tests to analyse and interpret our data. Let’s add a log2-scaled intensity column to our data so we don’t have to transform the original intensities every time we need them.
hist(iprg$Log2Intensity,
xlab = "log2 transformed intensities",
main = "Histogram of iPRG data")
In this case, we have duplicated information in our data, we have the raw and log-transformed data. This is not necessary (and not advised), as it is straightforward to transform the data on the flight.
hist(log2(iprg$Intensity),
xlab = "log2 transformed intensities",
main = "Histogram of iPRG data")
We look at the summary for the log2-transformed values including the value for the mean. Let’s fix that first.
summary(iprg$Log2Intensity)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 16.37 23.78 24.68 24.82 25.78 31.42
Challenge
Reproduce the histogram above but plotting the data on the log base 10 scale, using the
log10function. See also the more generallogfunction.
Boxplots are extremely useful because they allow us to quickly visualise the data distribution, without making assumptions of the distribution type (non-parametric). We can read up on what statistics the different elements of a box-and-whisker represent in the R help files.
boxplot(iprg$Log2Intensity)
The boxplot, however, shows us the intensities for all conditions and replicates. If we want to display the data for, we have multile possibilities.
by functionint_by_run <- by(iprg$Log2Intensity, iprg$Run, c)
boxplot(int_by_run)
boxplot(Log2Intensity ~ Run, data = iprg)
ggplot2 package that is very flexible to visualise data under different angles.ggplot2 packageggplot2 is a plotting package that makes it simple to create complex plots from data in a data frame. It provides a more programmatic interface for specifying what variables to plot, how they are displayed, and general visual properties, so we only need minimal changes if the underlying data change or if we decide to change from a bar plot to a scatterplot. This helps in creating publication quality plots with minimal amounts of adjustments and tweaking.
ggplot2Base graphics
Uses a canvas model a series of instructions that sequentially fill the plotting canvas. While this model is very useful to build plots bits by bits bottom up, which is useful in some cases, it has some clear drawback:
The grammar of graphics
The components of ggplot2’s of graphics are
Fist of all, we need to load the ggplot2 package
library("ggplot2")
ggplot graphics are built step by step by adding new elements.
To build a ggplot we need to:
data argumentggplot(data = iprg)
aes), by selecting the variables to be plotted and the variables to define the presentation such as plotting size, shape color, etc.ggplot(data = iprg, aes(x = Run, y = Log2Intensity))
geoms – graphical representation of the data in the plot (points, lines, bars). To add a geom to the plot use + operatorggplot(data = iprg, aes(x = Run, y = Log2Intensity)) +
geom_boxplot()
See the documentation page to explore the many available geoms.
The + in the ggplot2 package is particularly useful because it allows you to modify existing ggplot objects. This means you can easily set up plot “templates” and conveniently explore different types of plots, so the above plot can also be generated with code like this:
## Assign plot to a variable
ints_plot <- ggplot(data = iprg, aes(x = Run, y = Log2Intensity))
## Draw the plot
ints_plot + geom_boxplot()
Notes:
Anything you put in the ggplot() function can be seen by any geom layers that you add (i.e., these are universal plot settings). This includes the x and y axis you set up in aes().
You can also specify aesthetics for a given geom independently of the aesthetics defined globally in the ggplot() function.
The + sign used to add layers must be placed at the end of each line containing a layer. If, instead, the + sign is added in the line before the other layer, ggplot2 will not add the new layer and will return an error message.
Challenge
- Repeat the plot above but displaying the raw intensities.
- Log-10 transform the raw intensities on the flight when plotting.
First, let’s colour the boxplot based on the condition:
ggplot(data = iprg,
aes(x = Run, y = Log2Intensity,
fill = Condition)) +
geom_boxplot()
Now let’s rename all axis labels and title, and rotate the x-axis labels 90 degrees. We can add those specifications using the labs and theme functions of the ggplot2 package.
ggplot(aes(x = Run, y = Log2Intensity, fill = Condition),
data = iprg) +
geom_boxplot() +
labs(title = 'Log2 transformed intensity distribution per MS run',
y = 'Log2(Intensity)',
x = 'MS run') +
theme(axis.text.x = element_text(angle = 90))
And easily switch from a boxplot to a violin plot representation by changing the geom type.
ggplot(aes(x = Run, y = Log2Intensity, fill = Condition),
data = iprg) +
geom_violin() +
labs(title = 'Log2 transformed intensity distribution per Subject',
y = 'Log2(Intensity)',
x = 'MS run') +
theme(axis.text.x = element_text(angle = 90))
Finally, we can also overlay multiple geoms by simply adding them one after the other.
p <- ggplot(aes(x = Run, y = Log2Intensity, fill = Condition),
data = iprg)
p + geom_boxplot()
p + geom_boxplot() + geom_jitter() ## not very usefull
p + geom_jitter() + geom_boxplot()
p + geom_jitter(alpha = 0.1) + geom_boxplot()
Challenge
- Overlay a boxplot goem on top of a jitter geom for the raw or log-10 transformed intensities.
- Customise the plot as suggested above.
Finally, a very useful feature of ggplot2 is facetting, that defines how to subset the data into different panels (facets).
names(iprg)
## [1] "Protein" "Log2Intensity" "Run" "Condition"
## [5] "BioReplicate" "Intensity" "TechReplicate"
ggplot(data = iprg,
aes(x = TechReplicate, y = Log2Intensity,
fill = Condition)) +
geom_boxplot() +
facet_grid(~ Condition)
You can save plots to a number of different file formats. PDF is by far the most common format because it’s lightweight, cross-platform and scales up well but jpegs, pngs and a number of other file formats are also supported. Let’s redo the last barplot but save it to the file system this time.
Let’s save the boxplot as pdf file.
pdf()
p + geom_jitter(alpha = 0.1) + geom_boxplot()
dev.off()
The default file name is Rplots.pdf. We can customise that file name specifying it by passing the file name, as a character, to the pdf() function.
Challenge
Save a figure of your choice to a pdf file. Read the manual for the
pngfunction and save that same image to a png file.Tip: save your figures in a dedicated directory.
Finally, we can save this whole session you worked so hard on! We can save individual variables using the save function, or save the complete environment with save.image. Be careful though, as this can save a lot of unnecessary (temporary) data.
save.image(file = '02-rstats-all.rda')
Tip: The best way to save your work is to save the script that contains the exact command that lead to the results! Or better, we can save and document our full analysis in an R markdown file!
Let’s assume that we have the population with a total of 10 subjects. Suppose we label them from 1 to 10 and randomly would like to select 3 subjects we can do this using the sample function. When we run sample another time, different subjects will be selected. Try this a couple times.
sample(10, 3)
## [1] 1 4 8
sample(10, 3)
## [1] 3 10 9
Now suppose we would like to select the same randomly selected samples every time, then we can use a random seed number.
set.seed(3728)
sample(10, 3)
## [1] 5 8 7
set.seed(3728)
sample(10, 3)
## [1] 5 8 7
Let’s practice with fun example. Select two in our group member for coming early next Monday.
group.member <- c('Meena', 'Tsung-Heng', 'Ting', 'April', 'Dan', 'Cyril', 'Kylie', 'Sara')
sample(group.member, 2)
## [1] "Dan" "Cyril"
We can also create a random order using all elements of iPRG dataset. Again, we can achieve this using sample, asking for exactly the amount of samples in the subset. This time, each repetition gives us a different order of the complete set.
msrun <- unique(iprg$Run)
msrun
## [1] JD_06232014_sample1_B.raw JD_06232014_sample1_C.raw
## [3] JD_06232014_sample1-A.raw JD_06232014_sample2_A.raw
## [5] JD_06232014_sample2_B.raw JD_06232014_sample2_C.raw
## [7] JD_06232014_sample3_A.raw JD_06232014_sample3_B.raw
## [9] JD_06232014_sample3_C.raw JD_06232014_sample4_B.raw
## [11] JD_06232014_sample4_C.raw JD_06232014_sample4-A.raw
## 12 Levels: JD_06232014_sample1_B.raw ... JD_06232014_sample4-A.raw
## randomize order among all 12 MS runs
sample(msrun, length(msrun))
## [1] JD_06232014_sample1_B.raw JD_06232014_sample3_C.raw
## [3] JD_06232014_sample4_B.raw JD_06232014_sample1_C.raw
## [5] JD_06232014_sample3_B.raw JD_06232014_sample4_C.raw
## [7] JD_06232014_sample3_A.raw JD_06232014_sample2_B.raw
## [9] JD_06232014_sample1-A.raw JD_06232014_sample2_C.raw
## [11] JD_06232014_sample4-A.raw JD_06232014_sample2_A.raw
## 12 Levels: JD_06232014_sample1_B.raw ... JD_06232014_sample4-A.raw
## different order will be shown.
sample(msrun, length(msrun))
## [1] JD_06232014_sample3_B.raw JD_06232014_sample2_A.raw
## [3] JD_06232014_sample3_A.raw JD_06232014_sample2_B.raw
## [5] JD_06232014_sample2_C.raw JD_06232014_sample3_C.raw
## [7] JD_06232014_sample4-A.raw JD_06232014_sample1-A.raw
## [9] JD_06232014_sample4_B.raw JD_06232014_sample1_C.raw
## [11] JD_06232014_sample1_B.raw JD_06232014_sample4_C.raw
## 12 Levels: JD_06232014_sample1_B.raw ... JD_06232014_sample4-A.raw
Allow to remove known sources of variability that you are not interested in.
Group conditions into blocks such that the conditions in a block are as similar as possible.
Randomly assign samples with a block.
This particular dataset contains a total of 12 MS runs across 4 conditions, 3 technical replicates per condition. Using the block.random function in the psych package, we can achieve randomized block designs! block.random function makes random assignment of n subjects with an equal number in all of N conditions.
library("psych") ## load the psych package
msrun <- unique(iprg[, c('Condition','Run')])
msrun
## Condition Run
## 1 Condition1 JD_06232014_sample1_B.raw
## 2 Condition1 JD_06232014_sample1_C.raw
## 3 Condition1 JD_06232014_sample1-A.raw
## 4 Condition2 JD_06232014_sample2_A.raw
## 5 Condition2 JD_06232014_sample2_B.raw
## 6 Condition2 JD_06232014_sample2_C.raw
## 7 Condition3 JD_06232014_sample3_A.raw
## 8 Condition3 JD_06232014_sample3_B.raw
## 9 Condition3 JD_06232014_sample3_C.raw
## 10 Condition4 JD_06232014_sample4_B.raw
## 11 Condition4 JD_06232014_sample4_C.raw
## 12 Condition4 JD_06232014_sample4-A.raw
## 4 Conditions of 12 MS runs randomly ordered
block.random(n = 12, c(Condition = 4))
## blocks Condition
## S1 1 4
## S2 1 2
## S3 1 3
## S4 1 1
## S5 2 2
## S6 2 1
## S7 2 3
## S8 2 4
## S9 3 4
## S10 3 3
## S11 3 2
## S12 3 1
block.random(n = 12, c(Condition = 4, BioReplicate=3))
## blocks Condition BioReplicate
## S1 1 4 2
## S2 1 4 1
## S3 1 3 1
## S4 1 1 1
## S5 1 1 2
## S6 1 3 3
## S7 1 2 1
## S8 1 2 3
## S9 1 4 3
## S10 1 2 2
## S11 1 1 3
## S12 1 3 2
Let’s start data with one protein as an example and calculate the mean, standard deviation, standard error of the mean across all replicates per condition. We then store all the computed statistics into a single summary data frame for easy access.
We can use the aggregate function to compute summary statistics. aggregate splits the data into subsets, computes summary statistics for each, and returns the result in a convenient form.
# check what proteins are in dataset, show all protein names
head(unique(iprg$Protein))
## [1] sp|D6VTK4|STE2_YEAST sp|O13297|CET1_YEAST sp|O13329|FOB1_YEAST
## [4] sp|O13539|THP2_YEAST sp|O13547|CCW14_YEAST sp|O13563|RPN13_YEAST
## 3027 Levels: sp|D6VTK4|STE2_YEAST ... sp|Q9URQ5|HTL1_YEAST
# Let's start with one protein, named "sp|P44015|VAC2_YEAST"
oneproteindata <- iprg[iprg$Protein == "sp|P44015|VAC2_YEAST", ]
# there are 12 rows in oneproteindata
oneproteindata
## Protein Log2Intensity Run
## 21096 sp|P44015|VAC2_YEAST 26.30163 JD_06232014_sample1_B.raw
## 21097 sp|P44015|VAC2_YEAST 26.11643 JD_06232014_sample1_C.raw
## 21098 sp|P44015|VAC2_YEAST 26.29089 JD_06232014_sample1-A.raw
## 21099 sp|P44015|VAC2_YEAST 25.81957 JD_06232014_sample2_A.raw
## 21100 sp|P44015|VAC2_YEAST 26.11527 JD_06232014_sample2_B.raw
## 21101 sp|P44015|VAC2_YEAST 26.08498 JD_06232014_sample2_C.raw
## 21102 sp|P44015|VAC2_YEAST 23.14806 JD_06232014_sample3_A.raw
## 21103 sp|P44015|VAC2_YEAST 23.32465 JD_06232014_sample3_B.raw
## 21104 sp|P44015|VAC2_YEAST 23.29555 JD_06232014_sample3_C.raw
## 21105 sp|P44015|VAC2_YEAST 20.94536 JD_06232014_sample4_B.raw
## 21106 sp|P44015|VAC2_YEAST 21.71424 JD_06232014_sample4_C.raw
## 21107 sp|P44015|VAC2_YEAST 20.25209 JD_06232014_sample4-A.raw
## Condition BioReplicate Intensity TechReplicate
## 21096 Condition1 1 82714388 B
## 21097 Condition1 1 72749239 C
## 21098 Condition1 1 82100518 A
## 21099 Condition2 2 59219741 A
## 21100 Condition2 2 72690802 B
## 21101 Condition2 2 71180513 C
## 21102 Condition3 3 9295260 A
## 21103 Condition3 3 10505591 B
## 21104 Condition3 3 10295788 C
## 21105 Condition4 4 2019205 B
## 21106 Condition4 4 3440629 C
## 21107 Condition4 4 1248781 A
# If you want to see more details,
?aggregate
## splits 'oneproteindata' into subsets by 'Condition',
## then, compute 'FUN=mean' of 'log2Int'
sub.mean <- aggregate(Log2Intensity ~ Condition,
data = oneproteindata,
FUN = mean)
sub.mean
## Condition Log2Intensity
## 1 Condition1 26.23632
## 2 Condition2 26.00661
## 3 Condition3 23.25609
## 4 Condition4 20.97056
\[ s = \sqrt{\frac{1}{n-1} \sum_{i=1}^n (x_i - \bar x)^2} \]
Challenge
Using the
aggregatefunction above, calculate the standard deviation, by applying themedianfunction.
## Condition Log2Intensity
## 1 Condition1 26.29089
## 2 Condition2 26.08498
## 3 Condition3 23.29555
## 4 Condition4 20.94536
Using the
aggregatefunction above, calculate the standard deviation, by applying thesdfunction.
## Condition Log2Intensity
## 1 Condition1 0.10396539
## 2 Condition2 0.16268179
## 3 Condition3 0.09467798
## 4 Condition4 0.73140174
Challenge
Using the
aggregatefunction above, count the number of observations per group with thelengthfunction.
## Condition Log2Intensity
## 1 Condition1 3
## 2 Condition2 3
## 3 Condition3 3
## 4 Condition4 3
\[ SE = \sqrt{\frac{s^2}{n}} \]
sub.se <- sqrt(sub.sd$Log2Intensity^2 / sub.len$Log2Intensity)
sub.se
## [1] 0.06002444 0.09392438 0.05466236 0.42227499
We can now make the summary table including the results above (mean, sd, se and length).
## paste0 : concatenate vectors after convering to character.
(grp <- paste0("Condition", 1:4))
## [1] "Condition1" "Condition2" "Condition3" "Condition4"
## It is equivalent to paste("Condition", 1:4, sep="")
summaryresult <- data.frame(Group = grp,
mean = sub.mean$Log2Intensity,
sd = sub.sd$Log2Intensity,
se = sub.se,
length = sub.len$Log2Intensity)
summaryresult
## Group mean sd se length
## 1 Condition1 26.23632 0.10396539 0.06002444 3
## 2 Condition2 26.00661 0.16268179 0.09392438 3
## 3 Condition3 23.25609 0.09467798 0.05466236 3
## 4 Condition4 20.97056 0.73140174 0.42227499 3
error bars can have a variety of meanings or conclusions if what they represent is not precisely specified. Below we provide some examples of which types of error bars are common. We’re using the summary of protein sp|P44015|VAC2_YEAST from the previous section and the ggplot2 package as it provides a convenient way to make easily adaptable plots.
# means without any errorbar
p <- ggplot(aes(x = Group, y = mean, colour = Group),
data = summaryresult)+
geom_point(size = 3)
p
Let’s change a number of visual properties to make the plot more attractive.
labs(title="Mean", x="Condition", y='Log2(Intensity)')theme_bw()'none' to remove it)p2 <- p + labs(title = "Mean", x = "Group", y = 'Log2(Intensity)') +
theme_bw() +
theme(plot.title = element_text(size = 25, colour = "darkblue"),
axis.title.x = element_text(size = 15),
axis.title.y = element_text(size = 15),
axis.text.x = element_text(size = 13),
legend.position = 'bottom',
legend.background = element_rect(colour = 'black'),
legend.key = element_rect(colour = 'white'))
p2
Let’s now add the standard deviation:
# mean with SD
p2 + geom_errorbar(aes(ymax = mean + sd, ymin = mean - sd), width = 0.1) +
labs(title="Mean with SD")
Challenge
Add the standard error of the mean. Which one is smaller?
Challenge
Add the standard error of the mean with black color.
For each statistical distribution, we have function to compute
For the normale distribution norm, these are respectively
dnormpnormqnormrnormLet’s start by sampling 1000000 values from a normal distribution \(N(0, 1)\):
xn <- rnorm(1e6)
hist(xn, freq = FALSE)
rug(xn)
lines(density(xn), lwd = 2)
By definition, the area under the density curve is 1. The area at the left of 0, 1, and 2 are respectively:
pnorm(0)
## [1] 0.5
pnorm(1)
## [1] 0.8413447
pnorm(2)
## [1] 0.9772499
To ask the inverse question, we use the quantile function. The obtain 0.5, 0.8413447 and 0.9772499 of our distribution, we need means of:
qnorm(0.5)
## [1] 0
qnorm(pnorm(1))
## [1] 1
qnorm(pnorm(2))
## [1] 2
Finally, the density function gives us the height at which we are for a given mean:
hist(xn, freq = FALSE)
lines(density(xn), lwd = 2)
points(0, dnorm(0), pch = 19, col = "red")
points(1, dnorm(1), pch = 1, col = "blue")
points(2, dnorm(2), pch = 4, col = "orange")
Now that we’ve covered the standard error of the mean and the standard deviation, let’s investigate how we can add custom confidence intervals (CI) for our measurement of the mean. We’ll add these CI’s to the summary results we previously stored for protein sp|P44015|VAC2_YEAST.
Confidence interval:
\[\mbox{mean} \pm (SE \times \frac{\alpha}{2} ~ \mbox{quantile of t distribution})\]
To calculate the 95% confident interval, we need to be careful and set the quantile for two-sided. We need to divide by two for error. For example, 95% confidence interval, right tail is 2.5% and left tail is 2.5%.
summaryresult$ciw.lower.95 <- summaryresult$mean -
qt(0.975, summaryresult$len - 1) * summaryresult$se
summaryresult$ciw.upper.95 <- summaryresult$mean +
qt(0.975, summaryresult$len - 1) * summaryresult$se
summaryresult
## Group mean sd se length ciw.lower.95
## 1 Condition1 26.23632 0.10396539 0.06002444 3 25.97805
## 2 Condition2 26.00661 0.16268179 0.09392438 3 25.60248
## 3 Condition3 23.25609 0.09467798 0.05466236 3 23.02090
## 4 Condition4 20.97056 0.73140174 0.42227499 3 19.15366
## ciw.upper.95
## 1 26.49458
## 2 26.41073
## 3 23.49128
## 4 22.78746
# mean with 95% two-sided confidence interval
ggplot(aes(x = Group, y = mean, colour = Group),
data = summaryresult) +
geom_point() +
geom_errorbar(aes(ymax = ciw.upper.95, ymin = ciw.lower.95), width = 0.1) +
labs(title="Mean with 95% confidence interval", x="Condition", y='Log2(Intensity)') +
theme_bw() +
theme(plot.title = element_text(size=25, colour="darkblue"),
axis.title.x = element_text(size=15),
axis.title.y = element_text(size=15),
axis.text.x = element_text(size=13),
legend.position = 'bottom',
legend.background = element_rect(colour = 'black'),
legend.key = element_rect(colour='white'))
Challenges
Replicate the above for the 99% two-sided confidence interval.
Error bars with SD and CI are overlapping between groups!
Error bars for the SD show the spread of the population while error bars based on SE reflect the uncertainty in the mean and depend on the sample size.
Confidence intervals of n on the other hand mean that the intervals capture the population mean n percent of the time.
When the sample size increases, CI and SE are getting closer to each other.
We have two objects that contain all the information that we have generated so far:
summaryresults object, that contains all the summary statistics.iprg data frame, that was read from the csv file. This object can be easily regenerated using read.csv, and hence doesn’t necessarily to be saved explicity.save(summaryresult, file = "./data/summaryresults.rda")
save(iprg, file = "./data/iprg.rda")
We can also save the summary result as a csv file using the write.csv function:
write.csv(sumamryresult, file = "./data/summary.csv")
Tip: Exporting to csv is useful to share your work with collaborators that do not use R, but for wany continous work in R, to assure data validity accords platforms, the best format is rda.
Back to course home page